Unvisited URL Relevancy Calculation in Focused Crawling Based on Naïve Bayesian Classification

نویسندگان

  • Debashis Hati
  • Amritesh Kumar
  • Lizashree Mishra
چکیده

Vertical search engines use focused crawler as their key component and develop some specific algorithms to select web pages relevant to some pre-defined set of topics. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size of the web. Focused crawler aims to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. A focused crawler is an agent that targets a particular topic and visits and gathers only a relevant, narrow web segment while trying not to waste resources on irrelevant material. As the crawler is only a computer program, it cannot determine how relevant a web page is. The major problem is how to retrieve the maximal set of relevant and quality page. In our proposed approach, we classify the unvisited URL based on visited URLs attribute score, i.e., unvisited URLs are relevant to topics or not, and then decide based on seed page attribute score. Based on score, we put “Yes” or “No” values in the table. URLs attributes are: it’s Anchor text relevancy, its

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Priority based Semantic Web Crawler

The Internet has billions of web pages and these web pages are attached to each other using URL(Uniform Resource Allocation). Web crawler is a main module of Search engine that gathers these documents from WWW. Most of the web pages present on Internet are active and changes periodically. Thus, Crawler is required to update these web pages to update database of search engine. In this paper, pri...

متن کامل

Design of Improved Web Crawler By Analysing Irrelevant Result

A key issue in designing a focused Web crawler is how to determine whether an unvisited URL is relevant to the search topic. Effective relevance prediction can help avoid downloading and visiting many irrelevant pages. In this module, we propose a new learning-based approach to improve relevance prediction in focused Web crawlers. For this study, we chose Naïve Bayesian as the base prediction m...

متن کامل

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...

متن کامل

Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence

This work investigates the use of focussed crawling techniques for the discovery of environmental multimedia Web resources that provide air quality measurements and forecasts. Focussed crawlers automatically navigate the hyperlinked structure of the Web and select the hyperlinks to follow by estimating their relevance to a given topic, based on evidence obtained from the already downloaded page...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010